1 Intro

Our team was assigned to investigate bike data from New York Citi’s Citibikes in 2019.

2 Setup

2.2 Load Data

## spec_tbl_df [20,551,697 × 31] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ tripduration           : num [1:20551697] 320 316 591 2719 303 ...
##  $ starttime              : POSIXct[1:20551697], format: "2019-01-01 00:01:47" "2019-01-01 00:04:43" ...
##  $ stoptime               : POSIXct[1:20551697], format: "2019-01-01 00:07:07" "2019-01-01 00:10:00" ...
##  $ start_station_id       : Factor w/ 936 levels "72","79","82",..: 392 256 403 245 29 733 755 67 540 181 ...
##  $ start_station_name     : Factor w/ 939 levels "1 Ave & E 110 St",..: 264 702 157 2 515 482 62 193 292 429 ...
##  $ start_station_latitude : num [1:20551697] 40.8 40.8 40.8 40.7 40.7 ...
##  $ start_station_longitude: num [1:20551697] -74 -74 -74 -74 -74 ...
##  $ end_station_id         : Factor w/ 973 levels "72","79","82",..: 464 256 387 808 245 672 469 535 573 778 ...
##  $ end_station_name       : Factor w/ 977 levels "1 Ave & E 110 St",..: 931 392 440 865 380 610 453 508 785 375 ...
##  $ end_station_latitude   : num [1:20551697] 40.8 40.7 40.8 40.7 40.7 ...
##  $ end_station_longitude  : num [1:20551697] -74 -74 -74 -74 -74 ...
##  $ bikeid                 : num [1:20551697] 15839 32723 27451 21579 35379 ...
##  $ usertype               : Factor w/ 2 levels "Customer","Subscriber": 2 2 2 2 2 2 2 2 2 2 ...
##  $ birth_year             : num [1:20551697] 1971 1964 1987 1990 1979 ...
##  $ gender                 : Factor w/ 3 levels "female","male",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ age                    : num [1:20551697] 50 57 34 31 42 32 34 40 31 34 ...
##  $ starthour              : num [1:20551697] 0 0 0 0 0 0 0 0 0 0 ...
##  $ day                    : Date[1:20551697], format: "2019-01-01" "2019-01-01" ...
##  $ month                  : num [1:20551697] 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_weekday            : num [1:20551697] 3 3 3 3 3 3 3 3 3 3 ...
##  $ dayid                  : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ week_num               : num [1:20551697] 1 1 1 1 1 1 1 1 1 1 ...
##  $ distmeters             : num [1:20551697] 467 491 2050 1654 773 ...
##  $ speed                  : num [1:20551697] 1.461 1.553 3.469 0.608 2.552 ...
##  $ AWND                   : num [1:20551697] NA NA NA NA NA NA NA NA NA NA ...
##  $ PRCP                   : num [1:20551697] 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 ...
##  $ SNOW                   : num [1:20551697] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SNWD                   : num [1:20551697] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TAVG                   : num [1:20551697] 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 ...
##  $ TMAX                   : num [1:20551697] 58 58 58 58 58 58 58 58 58 58 ...
##  $ TMIN                   : num [1:20551697] 39 39 39 39 39 39 39 39 39 39 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   tripduration = col_double(),
##   ..   starttime = col_datetime(format = ""),
##   ..   stoptime = col_datetime(format = ""),
##   ..   start_station_id = col_double(),
##   ..   start_station_name = col_character(),
##   ..   start_station_latitude = col_double(),
##   ..   start_station_longitude = col_double(),
##   ..   end_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_latitude = col_double(),
##   ..   end_station_longitude = col_double(),
##   ..   bikeid = col_double(),
##   ..   usertype = col_character(),
##   ..   birth_year = col_double(),
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   starthour = col_double(),
##   ..   day = col_date(format = ""),
##   ..   month = col_double(),
##   ..   numWeekday = col_double(),
##   ..   dayid = col_character(),
##   ..   weekNum = col_double(),
##   ..   distmeters = col_double(),
##   ..   speed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Citibike system data was used from this data repository. We also used weather data provided from NWS. The full data set with over 20 million rides was used for most visualizations, but a 5% sample was used for examples including regression for the purpose of simplicity.

3 Weather

3.1 Does Temperature Affect the Number of Trips that are Taken?

As we see here, there is a considerable amount of our consumer base that is male compared to female. That being said, the proportion remains fairly consistent across the different temperatures and should be a point of emphasis in our marketing. However, it should be noted that the number of trips as temperature increases is significant and should be noted for the months in which there are higher/lower temperatures. In terms of a utilization strategy, it makes sense that we should have less bikes on the street during the winter months, one because the data suggests that there is a significant correlation with the number of trips that are taken and the temperature. We suggest a utilization strategy that continues to increase the number of bikes on the street as the year goes on and the temperatures get warmer. It tracks in the other direction as well. As winter comes along, we should begin decreasing the number of bikes on the street.

3.2 How Does Temperature Change Across the Months?

This animated chart is meant to inform the management on temperature movements thorughout the year for the purposes of fulfilling the aforementioned plan to have less bikes on the streets when its too hot/too cold, based on the graph that we have depicted first in this section of the Report. If we run a cross section, it would makes sense to have all of our bikes on the road from the months of May through the end of October. In the other months, we can have less bikes on the streets without having to worry about utilization issues, except for those stations that we are going to discuss later in this report.

3.3 What effect does Temperature Have on the Speed of Riders?

The main point of analyzing the speed of riders is to know how fast they are getting from point A to point B in order to fulfill our services to other customers that may be waiting for bikes at a specific station and/or lookng for a station with Bikes. The faster riders go, the less time they spend on bikes. That being said, we see here that as temperature increases, there is a negative trend to speed. This means that in the summer months, there are probably more leisure riders tying to see the city rather than get to point B as fast as possible. In addition to this, we ned to consider the relationship in the context of utilization. As stated before, we are looking to have the least amount of wait times for our customers as possible with high, but not overburdened utilization. This means that we should expect there to be higher capacity utilization in the months where it is warmer due to reduced speeds of our riders.

3.4 What Effect does Temperature Have on the Distance of Trips?

Simply put, this graph shows the relationship between distance traveled and the average temperature on any given day. It has a possitive relationship, meaning that as the days get warmer, we can expect there to be a majority of rides that incrase in distance. Again, it all goes back to utilization, and the longer that riders travel, the longer they are on the bike, all else equal. We expect this to add to our point that utilization will most likely be near, if not at, 100% for a majority of our bike stations during the months where it is warmer.

3.5 What Effect Does Precipitation Have on the Speed of Riders?

Obviously no one likes to be in the rain. This is why we have plotted the relationship of precipitation and the speed of our riders. On those days where it rains, we see an increase in the speeds at which riders go. Connecting it back to our point on utilization, although there is likely to be lower risk of capacity problems on these days, it may not make sense to decrease the number of bikes on the street because there are always stations that are over utilized in neighborhoods like mahattan compared to others east of the city.

3.6 What Effect Does Precipitation Have on the Distance of Trips?

It makes sense that riders would travel less distance to stay out of the rain, but the overall trend does not seem to have all that much effect on the actual riders themselves. This means that the majority of our riders have to get to where they need to no matter the weather conditions. From this we can conclude that we are a key transportation resource for those who may not be able to afford other public transportation options and a motor vehicle. We should consider this when pricing our services for the annual subscription.

3.7 Does Snow Impact the Number of Trips Taken?

As expected, the vast majority of trips are taken when there is little to no snow. Riders simply do not want to ride bikes in the snow. Something interesting to note here is that (at least visually) it seems that rides that are taken when there is snow are taken by subscribers as opposed to customers. This suggests that the subscribers ride more so out of necessity (to get to work, etc.) whereas customers may ride more for leisure.

3.9 What Effect Does Wind Have on the Speed of Riders?

There is a significantly greater increase in rider speed when there is precipitation as opposed to when there is wind. This could have to do with the fact that both headwind and tailwind can impact rider speed; if there is headwind, bikes will be slowed down, whereas if there is tailwind, bikes will travel faster. The two may essentially cancel each other out. It is important to note that this chart does not study the relationship between precipitation and wind and their combined effects on rider speed, but looking at wind alone, it seems that there is a minimal effect on speed.

3.10 What Effect Does Wind Have on the Distance of Trips?

Wind seems to have a minimal effect on how far bikers ride compared to the effect that precipitation has. This is interesting: does precipitation have such a substantial psychological impact on riders that it changes the routes they have planned? This certainly depends on the purpose of each ride, be it for work or for leisure. It is important to note that this chart does not study the relationship between precipitation and wind and their combined effects on travel distance, but looking at wind alone, it seems that there is in fact a small effect on distance.

3.11 Does Temperature Impact the Age of Riders?

Temperature has a significantly larger impact on the expected age of subscribers than customers (almost 33x). For both, there is a negative relationship between temperature and age. Subscribers likely ride more out of necessity, so older people are more likely to continue riding in the cold, explaining the larger y-intercept. Customers are less sensitive to temperature because they ride less in extreme temperatures to begin with. As temperature rises, age decreases. Younger people are more likely to ride when it’s warmer and older customers are more likely to ride when it’s colder.

4 Rides

4.3 Does Age Affect the Speed of Riders?

A person’s age appears to have a significant impact on the speed at which they ride. From ages 18 to 30, there is a sharp increase in average speed, going from 3 to 4.5 miles per hour. However, the average speed of riders starts to gradually decline around age 35, with that rate of decline increasing with age, especially in senior citizens (65-80 year old individuals). Oddly enough, individuals in their late 70s appear to ride at nearly the same speed as people in their early 20s!

5 Bike Usage

We wanted to investigate how individual bike usage changes.

5.1 How does bike usage change over time?

5.1.1 Number of Rides per Day per bike

The above histogram depicts the number of rides per day per bike. Most bikes are ridden reasonably frequently, about 5-10 times per day.

5.1.2 Average number of rides per given bike

The above histogram depicts the distribution of average daily rides across all bikes. It seems that the vast majority of bikes are ridden about 5 times per day, but there is an interesting trimodal distribution where lots of bikes are ridden 7 times per day, and another subset of bikes are ridden between 10 and 20 times per day.

5.2 How does bike usage change month to month?

5.2.1 Number of Rides per bike per Month

The above histograms depict the distribution of the number of rides on each bike across each month. It is very intersting to note that in the warmer months between May and November, there are bimodal distributions where a small subset of bikes are being ridden much more frequently than the rest of the bikes. And suggests that that subset of bikes is being ridden by a different group of people perhaps. We hypothesize this is due to the effects of higher tourism during those warmer months, and that the bikes ridden by tourists may not be mixing with other bikes.

5.3 What bikes are used the most?

5.3.1 Number of Rides per Bike

Now we broke bike usage out across the entire year to see if some bikes are being ridden more on a year-long basis. The above histogram shows bike usage the number of rides per bikes. The vast majority of bikes are ridden on average 1000 times per year, but there is a small subset of bikes that are ridden on average approximately 2700 times per year.

5.4 Do certain bikes make it back to the same stations?

We wanted to figure out how bikes are moving throughout the city. First we asked ourselves if bikes ever make it back to the same station they started at, which we measured as the number of times a bike ends a ride at a given station in a given time period.

5.4.1 Ever?

The above map depicts the stations were bikes most frequently returned to over the entire year.

5.4.2 On the same day?

The above map depicts the stations were bikes most frequently returned to on any given day.

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    2150355  114.9    3809238   203.5    2456481   131.2
## Vcells 1247652676 9518.9 2767242320 21112.4 2761497978 21068.6

5.5 How do the most used bikes move on their active days?

We wanted to find out how the most active bikes are moving around: are they mostly staying in the same areas and being ridden back and forth, or are they traversing long distancess.

The above map depicts the journeys of the most frequently ridden bikes on any given day. As we can see, bikes tend to traverse very long distances when they are being ridden frequently, and not ending up close to where they started.

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    2192390  117.1   20358544  1087.3   25448180  1359.1
## Vcells 1247748368 9519.6 3985004940 30403.2 2919118173 22271.2

6 Asymmetric Traffic

Next, we wanted to look at a more general analysis of bike usage by not constricting it to individual bikes, but instead looking at how do bikes in general move across the city.

6.1 First Let’s Break up New York City by Neighborhood

First, we broke out the New York City area by neighborhood using data from BetaNYC (link) in order to gain more granular insights into the New York CitiBike system.

6.2 Which Stations Have High Levels of Departures?

Here, we have overlaid the number of departures by CitiBike station onto our neighborhood map, color-coded by total stations in that neighborhood. As you can see, the stations with the highest number of departures over our data set were concentrated on the island of Manhattan, particularly in the Midtown area. The contrast between stations in the heart of Manhattan with stations in Brooklyn and Queens is stark. Many stations in Manhattan have > 50,000 departures whereas many in the other boroughs have been utilized < 1,000 times. Additionally, although there are neighborhoods in Southern Manhattan with many less stations than Midtown, these stations still experience a relatively high number of departures.

6.3 Which Stations Have Deficits and Surpluses?

This map shows each station and its corresponding deficit or surplus, defined as arrivals - departures. As is to be expected, the stations with the highest deficits are concentrated in and around the island of Manhattan, likely from a large amount of commuters utilizing the system every day to travel to their place of work. Interestingly, many of these stations with significant deficits are quite close to multiple other stations with surpluses. Because of this discrepancy, we recommend CitiBike implement a dynamic model to incentivize riders to park at surplus stations, rather than deficit stations. CitiBike could implement an algorithm similar to Uber’s, which uses multiple variables like time and distance to create prices for a given ride. CitiBike should create a similar model that offers riders ride credits, discounts or other perks to park at surplus stations. This would of course require CitiBike to track real-time deficits and surpluses at each station, as well as the level of bike inventory in order to scale the incentives accordingly.

6.4 Which Stations Are Being Utilized Most Heavily and Least Heavily?

For this visualization, we created a rough measure of each station's bike utilization using the data set and our own assumptions. First, we assume the average number of bikes at each station given the total number of bikes and stations in the data. Then, we calculated the number of 'turns' of this average based on the number of departures by station. We used departure counts rather than including arrivals because we wanted a measure of how many times bikes at a given station were taken and actually used. This map shows the top (green) and bottom (red) decile of stations based on bike utilization. As you can see, virtually all of the stations with top 10% (thousands of turns) levels of bike utilization are in Manhattan, whereas stations in the bottom 10% (less than 1) are mostly in Brooklyn. Based on this sample, we recommend CitiBike re-locate some of the least used stations from Brooklyn to Manhattan. This will balance the wear and tear that bikes are receiving and reduce the number of stations on Manhattan that have particularly high bike deficits. Using this tactic in addition to the incentive model we discussed prior could significantly smooth demand, inventory and usage.

6.5 Which Stations are in the Top and Bottom Deciles of Deficit/Surplus?

6.6 Table and Map of the Best and Worst 5 Stations for Deficits

These maps and table show the top and bottom deciles of station deficits as well as the 5 best and 5 worst offenders. We recommmend CitiBike use these visualizations in accordance with our earlier recommendations around creating incentives and re-locating stations. The best/worst stations above can serve as ‘beta tests’ for these recommendations. CitiBike could move a number of the least-utilized stations from Brooklyn adjacent to stations in Manhattan with the highest deficits, while testing the incentive model to measure the effect, for example.

7 Conclusion

We believe our above visualizations help illustrate Citibike’s ridership patterns and rider’s habits. As previously stated, we recommend Citibike to remove some stations in Brooklyn and install more stations in Manhattan. As depicted above, the turnover of stations in Manhattan is much higher and comparatively lower in Brooklyn. Areas of Brooklyn where turnover is particularly low but station networks are dense are underutilized despite the availability of bikes and bike docking, and Citibike is wasting resources and maintenance cost by having bikes and stations in these areas. Meanwhile, areas of Manhattan with high ridership and high turnover should receive more stations to meet the demand. As we’ve illustrated, demand in Manhattan is higher for both tourists and commuters, and so bikes are better utilized in these areas. We believe Citibike can see increased utilization and profitability by following this recommendation.

8 Contributors

Justin Applefield, Kenny Andrysiak, Evan Lipchin, Jacob Cohen, and Ian Cooper all contributed to this project.